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Abstract - Investigations of complexity of sequences lead to important 
applications such as effective data compression, testing of randomness, dis- 
criminating between information sources and many others. In this paper we 

> ; 

establish formulas describing the distribution functions of random variables 
00 

representing the complexity of finite sequences introduced by Lempel and Ziv 

o\ ; 

in 1976. We show that the distribution functions depend in an affine way on 
the probabilities of the so called "exact" sequences. 

Keywords : Complexity of sequence, distribution function, combinatorial 
problems, Lempel-Ziv parsing algorithms, randomness 



X 



I. Introduction 

The notion of complexity of a given sequence was first introduced in papers by 
Kolmogorov [3] and Chaitin [1]. Kolmogorov proposed to use the length of the 
shortest binary program which, when fed into a given algorithm, will cause it to 
produce a specified sequence, as a measure for the complexity of that sequence with 
respect to the given algorithm. If the length of the program is large we can say that 
the complexity of the sequence is large. 

In 1976 Lempel and Ziv [4] proposed and explored another approach to the prob- 
lem of the complexity of a specific sequence. They linked the complexity of a specific 
sequence to the gradual buildup of new patterns along the given sequence. The com- 



plexity measure suggested by them is related to the number of distinct phrases and 
the rate of their occurence along the sequence. It reflects the behaviour of a simple 
parsing algorithm whose task is to recognize newly encountered phrases during its 
scanning of a given sequence. In a series of papers, modifications of the Lempel-Ziv 
parsing algorithm were proposed in response to the needs of various applications. 
In general, in these algorithms a new phrase is established as the shortest substring 
which has not occurred previously, where the search for previous occurrences may 
be restricted or generalized in the modified algorithms in various ways, e.g.: by con- 
sidering only a fixed number of preceding symbols [8], by considering only complete 
previously established phrases (Lempel-Ziv Incremental Parsing Algorithm [9]), by 
allowing a number (not more than a fixed threshold) of previous occurrences of the 
phrase (Generalized Lempel-Ziv Algorithm [6]) etc.. 

It turned out that investigations of sequence complexity play an important role 
in universal data compression schemes and their numerous applications such as 
efficient transmission of data [8], [9], tests of randomness [16], discriminating between 
information sources [2], [10], estimating the statistical model of individual sequences 
[10] and many others. 

In this paper we introduce the concept of exact sequences i. e. sequences in which 
the last phrase of the sequence does not occur in the past (precise formulation: Def. 
3). We derive formulas describing the distribution function of random variables 
representing the complexity of finite sequences as defined by Lempel and Ziv in 
1976. These formulas turn out to be of afline form with respect to the probabilities 
of exact sequences. 

II. Lempel-Ziv complexity 
In this section we introduce the notation and recall basic definitions [4]. 
Let A be a finite alphabet and let a — \A\ denote the size of the alphabet. 
Let A n be the set of all sequences of length n over A and let S = Sis 2 ■ ■ ■ s n 
be an arbitrary element of A n . By S(i,j) we denote the substrings SjSj+i . . • Sj 
of S when i < j and S(i,j) = A when j < % . The partition 

H(S) =S(1, h 1 )S(h 1 + l,h 2 )... S(h m ^ + 1, n) (1) 



of S such that for every i, S(hi-i + 1, hi — 1) is a substring of 5(1, hi — 2) is 
called the history of S and the m strings Hi(S) = S(hi-i + 1, hi) , i — 1,2, ... ,m 
where h = and h m = n , are called the components of the history. (Note that 
h\ = 1). Let ch(S) denote the number of components in a history H(S) of S . 
Definition 1: The complexity c(S) of the sequence S is the number 

c(S) = mm{c H (S)} (2) 

where the minimum is over all histories of S . 

Definition 2: The component Hi(S) = S(hi-i + l,hi) is called exhaustive if 
this string does not appear in the string S(l,hi — 1) . A history of S is called 
exhaustive if each of its components, except possibly the last one, is exhaustive. 

It is easy to see that every sequence has a unique exhaustive history, denoted by 
He(S) . For instance, the exhaustive history of the sequence S=0011011101110110 
is given by the following parsing of S : 0, 01, 10, 111, 0110110 where successive 
components are separated by commas. 

Remark 1: It was proved in [4] that c(S) = ce{S) , where ce{S) is the number of 
components in He (S) . Thus, below we shall use ce as the definition of complexity. 

Definition 3: The sequence S = S1S2 ■ ■ ■ s n is called exact if the last string 
S(h m -i + l,n) in its exhaustive history H E (S) = S(l, h\)S{hi + 1, h 2 ) . . . S(h m -i + 
l,n) does not occur as a substring S(i,j) (where 1 < i < j < n — 1) in the 
sequence S(l, n — 1) — s± . . . s n _i . 

From now on we shall assume that for a fixed n any element of A n is equi- 
probable, i.e. we assign the same probability a~ n to each element of A n and 

P n : 2 A " - [0, 1] (3) 

denotes the probability in this sense. By P n (k) we denote the probability of the 
event consisting of all sequences of length n and complexity k while P n (k e ) is the 
probability of the event consisting of all exact sequences of length n and complex- 
ity k. 

Under the above assumptions for every n e DM we define the random variable 
C n : A n — > DM representing the complexity: 

C n (S) := c E (S) (4) 



for every sequence S G A n . 

III. The distribution function of C n 
In this section we describe the distribution function of C n , n G DM . We prove 
the following 

Theorem: Under the above notation, 

Pn+l(Cn+l < k) = 1 - J2 P r( k e) (5) 

r=l 

for every n, A; G DM . 

Proof: We first express P n+ i(k + 1) in terms of P n . 
By definition of P n we find that: 

- the number of sequences with complexity k+1 and length n is a n P n (k + l) , 

- the number of exact sequences with complexity k+1 and length n is 
a"P n ((fc + l) e ) , 

- the number of exact sequences with complexity k and length n is a n P n (k e ) . 

Taking into account the definitions of complexity and exact sequences we con- 
clude that every sequence with complexity k + 1 and length n+1 can be obtained 
from a sequence of length n in one of the following two ways only: 

- by adding a symbol to a sequence with complexity k + 1 which is not exact, 

- by adding a symbol to an exact sequence with complexity k . 

We also see that all sequences obtained from exact sequences of length n and 
complexity k + 1 by adding a symbol from A will increase their complexity to 
k + 2 and the number of such sequences is a ■ a n P n ((k + l) e ) . From the definition 
of P n+ i(k + 1) and the above observations we conclude that 

D ,, , -v a-a n P n {k+l)-a-a n P n {{k + l) e )+a-a n P n {k e ) 
P n+1 {k + l) = — (6) 

and thus 

P n+l {k + 1) = P n (k + 1) + P n {k e ) - P n ((k + l) e ) (7) 

for every n, A; G DM . 



Replacing n + 1 by n we have 

P n {k + 1) = P n ^{k + 1) + P n -!(k e ) - P n -i{{k + l) e ) . (8) 

Substituting (8) into (7) we obtain 

P n+1 (fc+l) = P n _ 1 (A: + l) + P n _ 1 (A; e )-P n _ 1 ((A; + l) e ) + F n (A; e )-P„((A;+l) e ). (9) 

We replace n by n — 1 in (8) and insert the result in (9). Continuing this process 
we arrive at 



P n+1 (k +l)=P 1 (k+l) + J2l P r(ke) ~ Pr((k + l) e )] . (10) 

r=l 

Since Pi (A; + 1) = for k > 1 we have 

n n 

P n+1 (k + 1) = J2 P r( k e) ~ Y, P r(( k + l)e) ( U ) 

r=l r=l 

for every A;, n G DM . 

Now, replacing in (11) k by k + 1 , A; + 2 , A; + 3, . . . , A; + (n — A;) — 1 we obtain 



P n+1 (A: + 2) = £P r ((fc + l)e) - E p r((fc + 2) t 

r=l r=l 

ra n 

P n+ i(k + 3) = £P r ((A: + 2) e ) - E p r((fc + 3)« 



r=l r=l 
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^n+lH = £P r ((n _ i) e ) _ ]TP r (n e ) . 



r=l r=l 



Adding (11) and the above equations and taking into account the fact that 

n 

^^P r {n e ) = for n > 2 we have 



r=l 

n—k 



J2Pn + l(k + s)=J2Pr(h) ■ (13) 

s=l r=l 

One can easily see that P n+ i(k + s) = for s > n — k , where n > A; > 1 . 

Thus, we obtain the following expression for the distribution function of C n+ i : 

Pn+i(Cn+i < k) = 1 - £P r (A; e ) , (14) 

r=l 



which finishes the proof. 

Corollary 1: For every n and k , 

Pn+l(Cn+l <k)= P n (C n <k)~ P n (k e ) . (15) 

Proof: From (14) we have 

n— 1 
1 - J2 P r(ke) = Pn(C n < k) . (16) 



r=l 

Adding (14) and (16) we obtain (15). 

Remark 2: It follows from the above corollary that P n +i(C n+ i < k) < 

P-ny-'n < k). 

Corollary 2: From (14) and the fact that [4] 

lim P n (C n <k) = (17) 

n^oo 

we deduce that 

oo 

Y, P r( k e) = 1 • (18) 

r=l 

IV. Final Remarks 
The complexity of sequences was suggested as a statistical test of randomness 

of a random number generators and block ciphers [5], [7]. It was proved in [4] that 

n 
lim P n (C n < ) = 0. Therefore, the sets K nk := {S G A n : C n (S) < k} seem 

n 

to be good candidates for critical sets (usually k is assumed [71 to be ). 

\og a n 

This means, in fact, that for an arbitrarily chosen probability p close to there 

is no such that for n > no, for a given randomly chosen sequence S the inequality 

n 

C n (S) < holds with probability less than p. Thus, it is essential to estimate 

log a n 

k 

P n {K n ^k) = *^2P n (s) , i.e. the levels of significance for K n ^ . In practice, for a fixed 
n these sums are computed numerically by finding all terms. Formula (15) makes 
it possible to find the probability P n+ i(K n+ i t k) for sequences of length n + 1 from 
the probabilities P n (K n ^) and P n (k e ) for sequences of length n (the latter two 
can be calculated simultaneously). This reduces the computation time. 
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